Tagging the Dutch PAROLE Corpus
نویسندگان
چکیده
We discuss the annotation with part of speech and lemma of the Dutch PAROLE Internet Corpus. The PAROLE PoS tagger is a combination of statistical taggers. It includes the Markov tagger TnT and 3 taggers developed at the INL with the purpose of using other information besides the training data. Lemma is assigned by a deterministic procedure, based on an extensive lexicon. The output is in some respects not entirely satisfactory; we discuss what can be done about this without having to manually correct the complete corpus.
منابع مشابه
Implementation and Evaluation of PAROLE PoS in a National Context
We are annotating the complete 20 million Dutch PAROLE corpus with PoS and lemma. The morphosyntactic tagging of 250,000 words during the PAROLE project was the first confrontation of the fine-grained Dutch PAROLE tagset and its ’functional’ mode of application, with real corpus data. The correction of the manual tagging and the compilation of a 100,000 words training corpus for the automatic t...
متن کاملPutting the Dutch PAROLE Corpus to Work
We discuss the activities towards the development of the retrieval application of the Dutch PAROLE Corpus. Compared to the other corpora developed by INL, the PAROLE Corpus has been encoded with more extended types of metadata, conformant to the TEI standard for text encoding. A search engine and a web-based user interface, both newly developed by INL, provide the user with the functionality to...
متن کاملPOS-tagging of Historical Dutch
We present a study of the adequacy of current methods that are used for POS-tagging historical Dutch texts, as well as an exploration of the influence of employing different techniques to improve upon the current practice. The main focus of this paper is on (unsupervised) methods that are easily adaptable for different domains without requiring extensive manual input. It was found that modernis...
متن کاملFrom D-Coi to SoNaR: a reference corpus for Dutch
The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi ...
متن کاملPart of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus
This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001